Regular Expressions

Regular Expressions

Previous Top Next

The Find and Replace dialog boxes support search criteria with basic regular expressions. To enable this mode, just put a checkmark in the "Regular Expression" checkbox. You can also specify regular expressions with the Find and Replace Clip commands by using the "R" option.

Those familiar with UNIX text utilities will find that the engine is somewhere between EGREP and AWK with regard to the complexity of problems that it can solve.

Find Patterns

Regular expression patterns are composed of the following:

Period (.)
Matches any single character except newline. A newline (internally) is really two characters in a specific order -- <carriage return> followed by <linefeed>. To match a newline, you must always explicitly specify a newline.

Caret (^)
Matches at the beginning of a line only. A ^ occurring ANYWHERE in the match expression (except within a character class) is interpreted in this manner. This allows meaningful use of ^ in combination with grouping or alternation (see below).

Dollar sign ($)
Matches at the end of a line only. As with ^ the $ character retains its special meaning anywhere within the expression (except in a character class).

Backslash (\)
Followed by a single character matches that character. For example, '\*' matches an asterisk, '\\' matches a backslash, '\$' matches a dollar sign, etc.

The following sequences have special meaning
\s   space (ASCII #32)
\t   tab (ASCII #9)
\r   return (ASCII #13)
\l   linefeed (ASCII #10)
\n   newline (#13 followed by #10)
\f   formfeed (ASCII #12)
\p   pipe character |
\w   any word delimiter. Matches any of \t\s!"&()*+,-./:;<=>?@[\]^`{|}~
\W   any nonword delimiter. Equivalent to [^\t\s!"&()*+,-./:;<=>?@[\]^`{|}~]
\h   any hex character. Matches any of 0123456789ABCDEF
\H   any nonhex character. Equivalent to [^0123456789ABCDEF]
\a   any character, including cariage return, line feed, form-feed, etc.
\b    any blank (white) space including space, tab, form-feed, etc. Equivalent to [\s\t\f\n\r]
\B    any nonwhite space character. Equivalent to [^\s\t\f\n\r]
\d   any digit character. Equivalent to [0-9]
\D   any nondigit character. Equivalent to [^0-9]

Case is ALWAYS significant when using the special characters. Thus \s matches a space while \S matches a capital letter S.

A single character not otherwise endowed with special meaning matches that character. Thus z matches a single instance of the letter z.

A string enclosed in brackets [] specifies a character class. Any single character in the string is matched. For example, [abc] matches an a, b, or c. Ranges of ASCII letters and numbers can be abbreviated as, for example, [a-z0-9]. If the first symbol following the [ is a caret (^) then a negative character class is specified. In this case, the string matches all characters EXCEPT those enclosed in the brackets. For example, [^a-z] matches everything except lower case characters (and newlines).

The special characters defined above may be used inside of character classes with the exception of \n, \w and \h, which are shorthand for their own character classes. If the characters - or ] are to be used literally inside of a character class, they should be preceded by the escape character \. Note that *?+(){}!^$#& are not special characters when found inside a character class.

Seeking Closure

A regular expression followed by * matches zero or more matches of the regular expression. This is referred to as a closure. Thus ba*b matches the string bb (no instances of a), bab (one instance), or baaaaaab (several instances).

A regular expression followed by a + matches one or more matches of the regular expression. This is another type of closure. In this case ba+b will not match bb, but it will match bab, or baaaaaab.

A regular expression followed by a ? matches zero or one matches of the regular expression. This is another closure. Here, ba?b will match bb or bab, but not baaaaaab.

Important: unfortunately, the ? closure does not always work correctly. Depending on the pattern, it sometimes fails in the event of zero occurances of the preceding character when the closure follows a regexp token (e.g., ".?", "\d?"). This problem is deep in the regexp engine, which is developed by a third-party company. We plan to use a new regexp engine based on Perl 5 in the next major update of NoteTab.

Concatenated Expressions

Two regular expressions concatenated match a match of the first followed by a match of the second. Thus (abc)(def) matches the string abcdef.

Alternation

Two regular expressions separated by | match either a match of the first or a match of the second. This is referred to as alternation. Any number of regular expressions can be strung together in this way. Alternation matches are tested in order from left to right, and the first match obtained is used. Then the remaining alternate expressions are skipped over.

Grouping Expressions

A regular expression enclosed in parentheses () matches a match of the regular expression. Parentheses are used to provide grouping, and may be nested to arbitrary depth. Open and close parentheses must be balanced. For example, the following two expressions are not equivalent, and the second probably expresses what was intended:

PROCEDURE|FUNCTION

(PROCEDURE)|(FUNCTION)

The first expression is equivalent to

PROCEDUR(E|F)UNCTION

The second expression matches either of the two words.

Tagged Matches

A regular expression enclosed in curly braces {} forms a tagged match word. Whatever was matched within the braces may be referred to by a Replace expression in a manner to be described. Tagged match words may not be nested. Open and close braces must be balanced. A maximum of nine tagged match words can be referenced by the Replace expression. For example, consider the expression

b{a*}b.

If the string being tested is "bab", then the tagged match word contains a single "a". If the string being tested is "baaaaaab", then the tagged match word contains "aaaaaa". If the string tested is "bb", then the tagged match word is empty.

Order of Precedence

Regular expressions are interpreted from left to right. The order of precedence of operators at the same parenthesis level is [], then *+!, then |, and then concatenation.

Tag braces are interpreted strictly from left to right and do not control precedence in any way. The first tagged match word found is given a tag of 1, the second a tag of 2, and so on up to a maximum tag of 9. The tag number that each word receives is based on when it is encountered in the line . If tags are skipped over as a result of alternation, then any remaining tags in a line receive shifted tag numbers. For example, consider the expression:

(FUNCTION)|({PROCEDURE})\s+{[^\s(]+}

If a line contains the word PROCEDURE then the word following PROCEDURE has a tag number of 2. If a line contains the word FUNCTION, then the word following FUNCTION has a tag number of 1. It is up to the user to take advantage of this behavior. Generally, it is good practice to surround an entire set of alternates with tag markers:

{(FUNCTION)|(PROCEDURE)}\s+{[^\s(]+}

Replace-with Patterns

Replace regular expressions are constructed the same way as Find regular expressions, but the number of operators is reduced. The replacement process occurs in the following manner:

The Find expression finds a string of text that starts at the leftmost position in the input line that matches, and continues to the rightmost position that matches. The string of matched text is operated upon by the Replace expression.

Replace expressions are composed of the following:

Single character
A single character not otherwise endowed with special meaning is treated literally.

Backslash (\)
Followed by a single character matches that character. For example, "\*" matches an asterisk, "\\" matches a backslash, "\$" matches a dollar sign, etc. A "\" followed by a single character treats that character literally. In this way a "\&" writes an ampersand and "\\" writes a backslash.

The following sequences have special meaning:
\s   space (ASCII #32)
\t   tab (ASCII #9)
\r   return (ASCII #13)
\l   linefeed (ASCII #10)
\n   newline (#13 followed by #10)
\f   formfeed (ASCII #12)
\z   null expression

Another special case occurs when "\" is followed by a single digit in the range of 1 through 9. In this case the tagged match word found by the Find expression is used in the resulting replacement text. If a tagged match word for that tag number was not defined, or if the tagged match word doesn't match anything, then nothing is output. The tagged match words can be used in any order and can be repeated any number of times.

An ampersand ("&") appearing in the Replace expression causes all text matched by the match expression to be sent to the output. The ampersand can appear in the Replace expression as many times as desired.

Examples:
The following examples use the NoteTab Replace dialog box (press Ctrl+R to open it). Make sure you tick the "regex box" before trying them.

Changes all H2 tags to H3:
Find: <H2>{.*}</H2>
Replace with: <H3>\1</H3>

Strips trailing blanks from each line:
Find: ^{.+[^\s]}\s*
Replace with: \1

Places each encountered word on a single line (Replace All can take quite long on big files!):
Find: \w*{['$#A-Z0-9]+}\w*
Replace with: \1\n

Converts all encountered e-mail addresses to HTML Mailto links:
Find: [A-Z_.-0-9]+@[A-Z_.-0-9]+
Replace with: <A href="mailto:&">&</a>